
SN SYSTEMS 

Sony Computer Entertainment Group 


Under the Compiler's Hood: 
Supercharge Your 
PLAYSTATION®3 (PS3™) 

Code. 


Understanding your compiler is the key to success in the gaming world. 





Supercharge your PS3 game code 


• Part 1: Compiler internals. 


• Part 2: How to write efficient C/C++ code. 
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Part 1: Compiler internals 


• Trees & parsing 

• Basic blocks 

• Data flow analysis 

• Alias analysis 

• I nvariant code motion 

• Load/Store elimination 

• Copy and constant propagation 

• Scheduling 

• Register allocation 

• Profile driven optimisations 

• + register allocation & "live ranges not scope are important" 


ii j; 
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HgP T rees & parsing 
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void f( int *p ) 

{ 

for( int i = 0; i < 6; 
++i ) 

{ 

p[ i ] = 0; 

} 

} 


As a first step the text of the file is 
converted into a tree structure. 




-ftoM 


TREETOP 

FUNC_HEADER 

FUNCTION 

_QlfPi 

P 

EXEC_STMT 
BLOCK 
i= 14:0 
WHILE 
i< 14:6 
BLOCK 

*(CAST(type_44,p))[i]= 14:0 
i=i+ 14:1 
return NIL 
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T rees & parsing 


• I nlining done by merging trees 

• Constant folding 

a + 1 + 2 
-> a + 3 


If conversion 

if( x == 0 ) y = a; else y = b; 
- >y = ( x == 0 ) ? a : b; 
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Basic blocks 


TREETOP 

FUNC_HEADER 

FUNCTION 

QlfPi 

P 

EXEC_STMT 
BLOCK 
i= 14:0 
WHILE 


i< 14:6 


BLOCK 

*(CAST(type_44,p))[i] = 14:0 
i=i+ 14:1 
return NIL 


Next, the tree is broken into basic 
blocks. 

A basic block is a section of code that 
contains no branches or labels. 


BB: 1 
i = 0 

* I F ( i < 6) ? goto BB: 2 else goto BB: 3 
BB:2 

tmp2 = 0 

tmp3 = impy ( i,4 ) 
tmp4 = copy4s ( tmp3 ) 
store ( tmp2, p, tmp4 ) 
tmp5 = iadd ( i,l ) 
tmp5 = copy4s ( tmp5 ) 
i = tmp5 

*1 F (i < 6) ? goto BB:2 else goto BB:3 
BB:3 

* RETURN 
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Basic blocks 


• Assignments translated to loads and stores 

• if, while, switch etc. converted to basic block boundaries 
example: 




int f( int 

*P ) 


{ 



*P = i; 


// 

int a = * 

p; 

// 

if ( a == 

i ) 

// 

{ 



p++; 


// 

} 



return a 

* 2; 

// 


} 


store 

load 

condition 
expression 
return expression 
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Basic blocks: Unrolling 


// 3 basic blocks 

// 1 basic block 

void f( int *p ) 

void f( int *p ) 

{ 

{ 

for( int i = 0; i != 3; ++i ) 

T3 
r— i 

O 

i__i 

II 

o 

■> ■ 

{ 

p[ i ] = i; 

P[ i ] = i; 

p[ 2 ] = 2; 

} 

} 

} 



isTr 

c'M'P 
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- 'I Data flow analysis 



Example: 

void f( int *p ) 

^ 0*1 { 

for( int i = 0; i < 6; ++i ) 

{ 

p[ i ] = 0; 


i = 

0 





'iflg 




plii 


1 96 


} 


m 


} 


ildlit 


0 


LJ 



iinpy 

t96 

4 





copy4s 

1 99 



iadd 


t96 


copy4s 


I 


t!03 


store 

t98 

P 

tioo 


\ 


l = 


t!04 
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Alias analysis 

• Tests to see if loads, stores and calls interfere with each other. 

• Enables the reordering of loads and stores. 

• Enables the elimination of redundant loads and stores. 

• Controllable using the restrict keyword. 

Example: 


void f( char *p, int *d ) 

t 

d[ 0 ] = l; 

int a = *p; // *p (char ) does not alias d[ n ] (int) 
d[ 1 ] = 2; 

int x=2; // d[ n ] does not alias x (formal vs stack) 

d[ 2 ] = a; 

9( &x ) ; // call, x may have been modified 

d[ 3 ] = x; 

} 


■mrth 
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Invariant code 

• Moves as much code as possible out of loops 

• Fewer instructions in loops 

• Dependent on aliasing! 

Example: 

void f( int a, int b, int *p, short *q ) 

for( unsigned i = 0; i != 100; ++i ) 

{ 

// load from q doesn't alias p, 

// so we can move it to before the loop. 
p[ i*2 ] = q[ 0 ] + a; 

// a + b is invariant, we can move it out of the loop. 
p[ i*2 + 1 ] = a + b; 

// store to q[ 1 ] is invariant, 

// we can move it to after the loop. 
q[ 1 ] = a; 

} 

} 
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Copy and constant propagation 


• Combine assignments and expressions 

• Uses fewer instructions 


Example: 

void f( int i, int *p ) 

{ 


// copy propagation, all the same variable 
int a = i; 
int b = a; 
int c = b; 


// constant propagation 
P[ c++ ] = o; // ->c+0 
P[ c++ ]=l;//->c+l 
P[ c++ ] = c; // -> c + 2 
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Scheduling 

• Re-order instructions to avoid 
stalls 

• FPU/VMX operations take 
many cycles to complete 


• Bad aliasing prevents 
efficient scheduling 


Example: 


float f( float *p, float a ) 
{ 

return a * a * 2. Of + 
a * 1 . 5f + 1.3f; 


} 
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critical path 
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Register allocation 

Expressions allocated a “Local" or "global" register in the 
function. 

Global registers usually in short supply. 

Too many registers used lead to "spills" to memory. 

Also if address is taken of variable. 

Example: 

x is in a global register 
x = a; 

if ( cond ) {/*...*/} 
y = x + 1; 

x is in a local register 

if ( cond ) {/*...*/} 
x = a; 
y = x + 1; 
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Profile driven optimisations 


• Use results of profiling to determine "hot" and "cold" 
code 

• Hot code gets more instructions 

- Inlining 

- Loop unrolling 

• Cold code gets fewer instructions 

- Moved away from hot code to prevent i cache 
pollution 

• On GCC "gcov" tool 
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Part 2: How to write efficient C/C++ code 


Maximising basic block sizes 

Minimising effects of latency 

Avoiding aliasing 

Type conversions and unions 

PS3 intrinsics vs. inline assembler 

Vector classes dos and don'ts 

Multithreading effects on PS3 

Virtual function calls and switches 

Console vs. PC programming 

Using SN systems tools to examine your code 

new SNC optimizations 


Sn Math Lib 
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Maximising basic block sizes 

• I nline everything you can and use fewer, larger modules 

• Use attribute ( ( always inline ) ) on small functions 

• Be aware of the high latency on floating point compares on PPU 

• Even predicted branches are slow on deeply pipelined processors 


void bad( bool x ) 

void good( bool x ) 

{ 

{ 

if( x ) 

if( x ) 

{ 

{ 

// do something 

// do something 

} 

// do something 

// do something 

// do something 

if( x ) 

} 

{ 

} 

// do something 


} 


} 
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Minimising effects of latency 


• I nterleave similar expressions in same basic block 


• Use two threads on the PPU 


aO = b0*3.14f + cO* 1.257f 
al = bl*3.14f + cl* 1.257f 
a2 = b2*3.14f + c2 * 1.257f 


• Load- hit- store on modern processor cores 


a[ 10 ] = b; 

c = a[ 10 ]; // same address 


• Floating point compare 


// try to make this block very big 
if( fabsf( x ) < epsilon ) {} 


• Simplify && and 1 1 expressions 


if( p[0] == 0 && p[l] == 0 ) 

int pi = p[ 1 ]; 


if ( p[0] == 0 && pi == 0 ) 

lwz 


cmp 

lwz 

be #2-22 cycles 

lwz 

lwz 

cmp 

cmp # 2-22 cycles 

cmp 

be 

crand 


be #2-22 cycles 
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Avoiding aliasing 


• May use the restrict keyword for similar pointer parameters 

• I nlining improves visibility of expressions 

• Aliasing manifests as 

- Seemingly redundant loads and stores 

- Bad scheduling 

• Move loads to start of basic block and stores to the end 

void Butterfly( float *pl, float *p2 ) 

{ 


pl[ 0 ] = p2[ 0 ] + p2[ 1 ]; 

pl[ 1 ] = p2[ 0 ] - p2[ 1 ] ; // bad, p2[0] and p2[l] must be reloaded 


} 


void Butterfly( float *pl, float *p2 ) 

{ 

float p20 = p2[ 0 ]; 
float p21 = p2[ 1 ]; 
pl[ 0 ] = p20 + p21; 

pl[ 1 ] = p20 - p21; // good, no need for reload 
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Type conversions and unions 

• Ok to use unions on the stack frame 

static inline float f( vector float x ) 

{ 

union { float f [ 4 ] ; vector float v; } u; 
u.v = x; 

return u.f[ 1 ]; 

> 


• Bad to use unions in classes - structure copies are ambiguous 


struct Naive 

{ 

union { float f[ 4 ]; vector float v; } u; 

}; 


float f( Naive x ) 

{ 

return x.f[ 0 ]; 

} 
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PS3 intrinsics vs. Inline assembler 


• I ntrinsics 

- schedulable 

- alias analysis 

- portable 

• atomic access 

• time base mftb() 

• time-saving machine ops fctiwzQ 

• io eieio() 

• debugging builtin_frame_address() 

• system calls system_call_nnn() 


• Inline asm 

- machine specific 

- not schedulable 

- more flexible? 
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Vector classes dos and don’ts 



Use attribute ( ( alwaysinline ) ) 

Use access functions instead of unions 

Pass by value if and only if class has one data member 

Always use multiples of 16 bytes 


• Mixing of float and VMX bad with GCC - use scalar class 

• Do float -> integer conversions straight to memory 

• Use "supervectors" to absorb latency 

- Groups of four or more vmx registers 

- Provides work to be done between register dependencies 

- Works over function calls 
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Multithreading effects on PS3 


• I instructions are executed alternately: effective latency is halved 

• Cache misses are covered by other thread 

• Use SPUs for any available task 

• Synchronization intrinsics have high latency 


• Any second thread is better than the default. 


Thread 0 


Iwz 


fadd 







Thread 1 


Iwz 


fadd 


stfs 


stfs 



Actual latency 2 + 9 + 1 




Effective latency (2 + 9 + l)/2 
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Virtual function calls and switches 


• Virtual function calls are incredibly useful 

- For high level control, Al and menus 

• Virtual function calls are evil! 

- Very slow 50+ cycles 

- Only use for 100+ instruction functions 

• Group values when using switches 

switch ( a ) // good: jump table 

{ 


case 1 : ... 
case 2 : ... 

} 

switch( a ) // bad: branch tree 

{ 


case 100 : 
case 200 : . . . 


} 

- Consider using look-up table before switch to cluster values 
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Console vs. PC programming 

• PCs 

- have extra hardware to minimize the effects of latency 

- have fewer CPUs 

- cannot use precompiled display lists 

- designed to run legacy code 

- load from hard drives 


• Consoles 

- are sensitive to latency 

- have many CPUs of different kinds 

- use precompiled display lists 

- run new code 

- load from DVD/Blu-ray 
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Console vs. PC programming 


• Avoid using malloc - use pools instead and pre-al locate 

• Pre-build display lists - CPU resource is precious, do not use it for rendering 

• Design data structures to be spooled from DVD - do not use "serialize" 
methods or class factories 

• Do not use global variables -global variable access is inefficient and uses 
data cache badly 

• Avoid virtual functions / indirect calls 

• Use fewer, larger modules for better interprocedural optimization - about 
ten to twenty modules is optimal for distributed builds. 
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Using SN Systems tools to examine your code 

• Debugger 

- Pipeline analyzer 

- Randomly stopping execution can reveal hotspots 

• Binary Utilities 

- Pipeline analyzer 

- Symbols 


• Tuner 

- Look for hotspots - the instruction before the hotspot is the bad one! 

- PC sampling 

- Auto instrumentation of functions 

- User labels 
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Using SN Systems tools to examine your code 


Sample tuner PC sampling 


„ t jq 20Q? 


2029 - 


137785 
— Itrtt- 
1739 
918 
3978 
171 


61489 


1325 - 
1526 - 
1182 - 
3394 - 
21 - 
1 - 


f () 

000103F0 

000103F4 

000103F8 

000103FC 

00010400 

00010404 

00010408 

0001040C 

00010410 

00010414 

00010418 

0001041C 

00010420 

00010424 

00010428 

0001042C 

00010430 


38802710 

3C601001 

7C8903A6 

88834228 

7C840774 

3084FFFF 

7C840E70 

30840001 

98834228 

88834228 

7C840774 

3084FFFE 

7C840E70 

30840002 

98834228 

4200FFD0 

4E800020 


li 

lis 


lbz 


— 

addic 

srawi 

addic 

i > 


lbz 

'sxr -b 

addic 

srawi 

addic 

stb 

bdnz 

blr 


volatile char mem; 
for( int j = 0; j != 10000; ++j ) 
{ 

mem = ( mem - 1 » 1 ) + 1; 
mem = ( mem - 2 » 1 ) + 2; 


r4 , 0x2710 
r3, 0x1001 
CTR r4 


r4 , 0x4228 (r3 ) 


1 i 

r4 , r4 , -0x1 
r4 , r4 , 1 
r4 , r4 , 0x1 

s 


r4 , 0x4228 (r3 ) 


i ' 4 . r 4 

r4 , r4 , -0x2 
r4 , r4 , 1 
r4 , r4 , 0x2 
r4 , 0x4228 (r3 ) 
0x000103FC 


Instruction causing delay delay 


Most common hotspots: 

• on branch targets (icache miss) 


PIPE 


03 ( 
01 ( 
01 ( 
01 ( 


01 ( 



000103F4 ) 
000103FC) 
00010400) 
00010404) 
00010408) 



00010410) 

UIJU1U414 

00010418) 

0001041C) 

00010420) 

00010424) 


REG LSU 
REG 

REG PIPE 
REG 

REG PIPE 
REG I.S1L 
PIPE 
REG 
REG PJ 
REG 
REG/>IPE 
REG/ .su 



LHS within 1 cycle, 
(requires 20 for safety) 


• on loads (dcache miss) 

• on loads (Load-Hit-Store) 
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New SNC optimizations 

• SSA analysis 

- Constant propagation 

- Memory optimizations 

- Use of VMX to replace int and float operations 

- Auto vectorization 

- Conversion of floating point compares to integer 

- Removal of fixed and zero iteration loops 
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SnMathLib 


• Worked example of a complete math class for PSP® 
(PlayStation ©Portable), PS3 and PC 

• Shows correct construction of math libraries 

• Scalar classes for mixed operation 


• I ncludes "Supervector" class "quadquad" for better scheduling 

- Four vector operations of same kind at a time 

- Fills in gaps between instruction issues 


• Extensive test suite 

- Performance 

- Accuracy (especially trig functions) 
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Essential reading 


• Engineering a compiler (Cooper & Torczon) 

• Wikipedia 

- http: // en. wi ki pedi a. org/ wi ki/ Category : Compi I ers 


• GCC internal documentation 

- http://qcc.qnu.org/onlinedocs/qccint/ 


• An interesting case study 

- http://www.flounder.com/optimization.htm 
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